Selecting More Informative Training Sets with Fewer Observations
نویسندگان
چکیده
Abstract A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting random sample them label using research assistants, training supervised learner remaining documents, and validating that model’s performance accuracy metrics. The most resource-intensive component this is hand-labeling: carefully reading paying human coders duplicate or more. We show hand-coding an algorithmically selected rather than simple-random can improve model above baseline by as much 50%, reduce costs up two-thirds, applications predicting (1) U.S. executive-order significance (2) financial sentiment on media. accompany manuscript with open-source software implement these tools, which we hope make learning cheaper more accessible researchers.
منابع مشابه
Mutual information-based method for selecting informative feature sets
Feature selection is one of the fundamental problems in pattern recognition and data mining. A popular and effective approach to feature selection is based on information theory, namely the mutual information of features and class variable. In this paper we compare eight different mutual information-based feature selection methods. Based on the analysis of the comparison results, we propose a n...
متن کاملSelecting maximally informative genes
Microarray experiments are emerging as one of the main driving forces in modern biology. By allowing the simultaneous monitoring of the expression of the entire genome for a given organism, array experiments provide tremendous insight into the fundamental biological processes that translate genetic information. One of the major challenges is to identify computationally efficient and biologicall...
متن کاملDoing More with Fewer Bits
We present a variant of the Diffie-Hellman scheme in which the number of bits exchanged is one third of what is used in the classical Diffie-Hellman scheme, while the offered security against attacks known today is the same. We also give applications for this variant and conjecture a extension of this variant further reducing the size of sent information.
متن کاملSelecting informative features with fuzzy-rough sets and its application for complex systems monitoring
One of the main obstacles facing current intelligent pattern recognition applications is that of dataset dimensionality. To enable these systems to be effective, a redundancy-removing step is usually carried out beforehand. Rough Set Theory (RST) has been used as such a dataset pre-processor with much success, however it is reliant upon a crisp dataset; important information may be lost as a re...
متن کاملFewer adjuncts: more relatives*
Substantially agreeing with Hornstein (2009: 81), “it is fair to say that what adjuncts are and how they function grammatically is not well understood”. I refer the reader to, e.g., Hornstein 2009: chapter 4, Hornstein & Nunes 2008, Hunter 2010 or most recently Hunter 2015 for a catalog of properties and problems adjuncts raise in general, and in particular for all previous proposals including ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Political Analysis
سال: 2023
ISSN: ['1047-1987', '1476-4989']
DOI: https://doi.org/10.1017/pan.2023.19